https://doi.org/10.1016/j.ascom.2021.100483

Summary

  • GPs assuming Gaussian noise suffer bias from outliers.
  • “Robust” formulations are often not analytically tractable.
  • Propose a method for iteratively trimming extreme observations (ITGP).
  • Two synthetic data studies.
  • Outperforms existing “robust” methods at the cost of execution time.
  • Targets model fit rather than parameter inference.
  • Use with caution!

GP Notation

For data \(\{(\boldsymbol{x}_i, y_i); i = 1, \ldots, N\}\) with \(\boldsymbol{x}_i \in \mathbb{R}^d\) and \(y_i \in \mathbb{R}\),

\[y_i = f(\boldsymbol{x}_i) + \varepsilon_i\]

We propose a underlying function,

\[f(\cdot) \sim\mathcal{MVN} \left( \mu(\boldsymbol{x};\boldsymbol\theta_\mu), k(\boldsymbol{x}, \boldsymbol{x'}; \boldsymbol{\theta}_k) \right)\]

where \(\mu(\cdot)\) is the mean function and \(k(\cdot)\) is the covariance kernel function, with hyperparameters \(\boldsymbol\theta_\mu\) and \(\boldsymbol\theta_k\), respectively.

“Standard GP”

Assumes observational noise is Gaussian distributed: \[\varepsilon \sim \mathcal{N}(0, \sigma^2_\textrm{n})\]

which makes \(\boldsymbol{y}\) also Gaussian distributed and analytically tractable.

Maximise the Likelihood

If noise is assumed as Gaussian then kernel hyperparameters found via maximum a posteriori estimation:

\[\mathrm p(\boldsymbol Y|\boldsymbol X) = \int \mathrm p(\boldsymbol{Y} | f, \boldsymbol{X})\cdot \mathrm p(f|\boldsymbol X) \mathrm{d}f \]

\[\begin{multline} \log (\mathrm p(\boldsymbol{Y} | \boldsymbol{X}, \boldsymbol{\theta})) = -\frac{1}{2}(\boldsymbol{Y} - \boldsymbol{\mu})^\top (\boldsymbol{K} + \sigma^2_n \boldsymbol{I})^{-1}(\boldsymbol Y - \boldsymbol\mu) \\- \frac{1}{2}\log(|\boldsymbol{K} + \sigma^2_n\boldsymbol I|) - \frac{n}{2}\log(2\pi) \end{multline}\]

But what if the noise is not Gaussian?

Robust Variants of GPs

  • Use heavier tailed distributions, e.g., Student-t, Laplace, Gaussian mixtures.

  • But these are usually not analytically tractable and tricky to implement.

  • And outliers still influence the model fit.

Propose: trim extreme values based on some criteria and refit.

GP and Iterative Trimming (ITGP)

  • Premise: Trim a proportion of points with the largest absolute residuals after fitting a GP.

  • Parameters: \(\alpha_1\), \(\alpha_2\), \(n_\textrm{sh}\), and \(n_\textrm{cc}\).

  • Comprises three phases:

    1. Shrinking

    2. Concentrating

    3. Reweighting

Shrinking & Concentrating

  1. Train standard GP on full sample and calculate normalised residual for each predicted point.

  2. Retrain GP with proportion \(\alpha\) of points which have the smallest normalised residuals.

  3. Repeat Step 2 for \(n_\textrm{sh}\) times while \(\alpha: 1 \rightarrow \alpha_1\).

  4. Repeat Step 2 for \(n_\textrm{cc}\) times keeping \(\alpha\) constant.

Trimmed sample will underestimate the true variance in the sample so there’s an extra correction factor at different \(\alpha\)’s.

Reweighting

  1. Exclude points whose corrected normalised residuals (\(r_i\)) are amongst the \((1 - \alpha_2)\)% largest.

  2. Retrain GP.

Finally identify outliers as those observations whose corrected normalised residuals \(r_i > 3\).

Synthetic Neal (1997) Datasets

\[f(x) = 0.3 + 0.4 x + 0.5 \sin(2.7 x) + \frac{1.1}{1+x^2} + \varepsilon\]

Results

RMSE of the prediction residuals of each test set

Parameter Settings

Parameter Suggested
\(\alpha_1\) 0.5
\(n_\textrm{sh}\) 2
\(n_\textrm{cc}\) 2
\(\alpha_2\) 0.975

Synthetic Star Clusters (Li et. al 2020)

Results

Limitations

  • biased by influential points near the boundary of data coverage where local SNR is low.

  • computation time is 3 to 5 times of standard GP.

Practical Guidance

Comments on Paper

  • very thorough simulation study, extended appendix on parameter selection.
  • result is not too surprising despite the many caveats.
  • again GP mean function is ignored!
  • all based on point estimates rather than intervals or distributions.
  • focus is on identifying outliers, not explaining them.